HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures

نویسندگان

چکیده

The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary each element in a single page, neglecting multi-page documents. This paper introduces hierarchical structures as novel task suitable for NLP and CV fields. To better evaluate system performance new task, we built large-scale dataset named HRDoc, which consists 2,500 with nearly 2 million units. Every HRDoc has line-level annotations including categories relations obtained from rule-based extractors human annotators. Moreover, proposed an encoder-decoder-based parsing (DSPS) tackle this problem. By adopting multi-modal bidirectional encoder structure-aware GRU decoder soft-mask operation, DSPS model surpass baseline method by large margin. All scripts datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Toward a Taxonomy of Logical Document Structures

The automated discovery of logical structure in text documents is an important problem that has recently received a good deal of attention; it can enable the creation of exible and sophisticated document manipulation tools that will greatly increase the impact of electronic documents. This paper addresses aspects of the nature of the logical structures to be found, in order to develop categorie...

متن کامل

A Hierarchical Document Description and Comparison Method

Determining the similarity of document images is an important first step for several document retrieval tasks, such as document classification, information extraction, and retrieval based on visual similarity. In this paper, we propose a method to describe and compare the content and layout of a document given only an image of the document. A tree structure is used to capture the hierarchical s...

متن کامل

Computation and Nanotechnology: Toward the Fabrication of Complex Hierarchical Structures

Enormous progress has been made in recent years in the nanostructuring of materials, and a variety of techniques are available for fabricating bulk materials with a desired nanostructure. However, the higher levels of organization have been neglected, and nanostructured materials are assembled into macroscopic structures using techniques that are not essentially different from those used for co...

متن کامل

Hierarchical Clustering in Medical Document Collections: the BIC-Means Method

Hierarchical clustering of text collections is a key problem in document management and retrieval. In partitional hierarchical clustering, which is more efficient than its agglomerative counterpart, the entire collection is split into clusters and the individual clusters are further split until a heuristically-motivated termination criterion is met. In this paper, we define the BIC-means algori...

متن کامل

A Level-wise Hierarchical Document Clustering method for Categorization

For document categorization, numerous words appearing in similar documents are divided into stopwords and keywords and to precisely describe documentary characteristics, documents are expressed by keywords without stopwords. For enhanced clustering precision, this paper proposed SHODC algorithm, a seed cluster-based hierarchical document clustering method, and DHODC method through domain stopwr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i2.25277